TripAdvisor is the website about travelling. Here you can find reviews about hotels, restaurants, tourist destination etc. In this notebook we will use the large TripAdvisor Data Set with hotels reviews. We want to examine some features of this dataset. This will be useful for further model selection and sentiment analysis.
Besides well-known libraries like pandas, numpy, plotly, seaborn, matplotlib, we will use missingno and pickle libraries. Pickle is useful for storing dataframes on the disk and missingno allows to examine missing values in the raw data.
import pandas as pd
import numpy as np
import json
import os
import pickle
import missingno as msno
import plotly.offline as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
py.init_notebook_mode(connected=True)
%matplotlib inline
IMPORTANT: you should execute the next cell only if you have the tripadv.pkl file already on your disk.
df = pd.read_pickle('tripadv.pkl')
IMPORTANT: the next 3 cells should be executed only if you have not executed the previous cell successfully. Otherwise, skip the next 3 cells.
In these cells you detect the folder with the dataset. Each hotel's data is stored in the separate json file. So you should extract the needed information from all these files and create pandas dataframe. The code in the second cell will do this. In the same time, you will need to wait quite a long time while this code will finish the data extraction.
Also, if you want to run the next 3 cells, you should place the folder json in the folder from where you notebook was started.
dir_path = os.path.dirname(os.path.realpath('Untitled.ipynb')) + '\json'
print(dir_path)
directory = os.fsencode(dir_path)
files = []
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".json"):
files.append(filename)
print(len(files))
df = []
counter = 0
no_data = 0
for file in os.listdir(directory):
if counter % 1000 == 0:
print(len(df))
counter += 1
filename = os.fsdecode(file)
if filename.endswith(".json"):
with open('json/' + filename) as json_data:
d = json.load(json_data)
df_hotel = []
for i in range(len(d['Reviews'])):
trial_df = pd.DataFrame(d['Reviews'][i]['Ratings'], index=[0])
content = d['Reviews'][i]['Content']
trial_df['Content'] = content
df_hotel.append(trial_df)
try:
df_hotel = pd.concat(df_hotel)
except:
no_data += 1
if len(df_hotel) > 0:
df.append(df_hotel)
df = pd.concat(df)
print()
print(no_data)
print(df.shape)
df.to_pickle('tripadv.pkl')
Let's look at our dataframe. We can see, that we have 11 columns and more than 1,6 mln rows. Content column is the column with text reviws. All other columns are ratings respect to the different aspects. However, you can see that there are actually duplicates columns which contain the information about the business service ratings. We should merge these columns.
print(df.shape)
df.head()
Below we can see that there are only 158k "valid" entries in the second column and 145k "valid" entries in the first column. All other entries are missed.
print(df['Business service (e.g., internet access)'].dropna().shape)
print(df['Business service'].dropna().shape)
In the next 4 cells we merge two above mentioned columns, check the new shape of this column (we see that there are more "valid" entries), delete the redundant column (all needed information we have already copied into another column) and look at the head of the updated dataframe:
df['Business service'] = df['Business service'].fillna(df['Business service (e.g., internet access)'])
print(df['Business service'].dropna().shape)
df = df.drop(['Business service (e.g., internet access)'], axis=1)
df.head()
Let's check the dataset for the missing values. In the next cell you can see how many "non-missing" values we have. Only Content and Overall columns have no missing values. In the same time, Business service, Check in / front desk and Sleep Quality columns are filled by missing values by more than a half of full length.
print('Check in / front desk: ', df['Check in / front desk'].dropna().shape)
print('Business service: ', df['Business service'].dropna().shape)
print('Cleanliness: ', df['Cleanliness'].dropna().shape)
print('Location: ', df['Location'].dropna().shape)
print('Rooms: ', df['Rooms'].dropna().shape)
print('Service: ', df['Service'].dropna().shape)
print('Sleep Quality: ', df['Sleep Quality'].dropna().shape)
print('Value: ', df['Value'].dropna().shape)
print('Overall: ', df['Overall'].dropna().shape)
print('Content: ', df['Content'].dropna().shape)
In the next cell we use missingno library to visualize the presence of missing values accross all columns and rows.
msno.matrix(df, color=(0.42, 0.1, 0.05))
In the next cell we look at the length of all reviwes in the Content column and measure some interesting statistics. Note, that the average length of the review is 961 symbol, but it can deviate by 923 symbols in both sides very often. Also, we have some review with more than 42 thousand symbols. It seems, that there are some very short reviews, because the shortest review has only 2 symbols and also the standard deviation and the mean points on this.
When building machine learning model for sentiment analysis it can be usefull to look at these very short reviews and get rid of some of them if there is no useful information.
lens = df['Content'].str.len()
print(lens.mean(), lens.std(), lens.min(), lens.max())
Let's also build a histogram which the distribution of the reviews length. We can see that most of reviews have the length between 500 and 1000 symbols. There are many reviews in the 0-500 symbols range. Almost all reviews have the length less than 5 thousand symbols.
data = [go.Histogram(x=lens, xbins=dict(start=0, end=43000, size=500), marker=dict(color='#8c42f4'))]
layout = go.Layout(
title='Length of reviews distribution',
xaxis=dict(
title='Length'
),
yaxis=dict(
title='Count'
),
bargap=0.1)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='length histogram')
Now we want to examine the distributions of ratings.
We start by demonstrating the distribution of the Overall ratings. We can see that the "0" rating has the smallest number of occurances in the dataset. Also, the number of occurances gradually increases from "1" to "5" rating.
plt.figure(figsize=(12,8))
sns.set(style="darkgrid", font_scale = 1.5)
b = sns.countplot(x='Overall', data = df.drop(['Content'], axis=1).astype(float))
b.axes.set_title('"Overall" rating distribution',fontsize=25)
b.set_xlabel("Rating",fontsize=20)
b.set_ylabel("Count",fontsize=20)
plt.figure(figsize=(14,8))
b = sns.countplot(x="value", hue="variable",
data=pd.melt(df.drop(['Content', 'Sleep Quality', 'Overall', 'Rooms', 'Location',
'Business service', 'Check in / front desk'], axis=1).astype(float)))
b.axes.set_title('"Cleanliness", "Service" and "Value" ratings distribution',fontsize=25)
b.set_xlabel("Rating",fontsize=20)
b.set_ylabel("Count",fontsize=20)
On the previous histogram we can see the distribution of Value, Service and Cleanliness ratings. The general pattern of increasing the number of occurances with the increase of the rating is present here. Also, we can see the wierd thing: there are some number of ratings with the value "-1". This can be due to several possible reasons: mistakes, replacing missing value by "-1", different ratings scale for some hotels or time period etc. When building the machine learning model this can be useful to experiment with these rows to understand their meaning.
plt.figure(figsize=(14,8))
b = sns.countplot(x="value", hue="variable",
data=pd.melt(df.drop(['Content', 'Overall', 'Check in / front desk', 'Business service', 'Service',
'Cleanliness', 'Value'], axis=1).astype(float)))
b.axes.set_title('"Location", "Sleep Quality" and "Rooms" ratings distribution',fontsize=25)
b.set_xlabel("Rating",fontsize=20)
b.set_ylabel("Count",fontsize=20)
On th previous chart we can see the histogram with the distribution of the Location, Sleep quality and Rooms ratings. We can note the presence of the above mentioned pattern of the increase the number of occurances with the increase in the rating. Also, there are no "-1" rating for the Sleep quality category.
plt.figure(figsize=(14,8))
b = sns.countplot(x="value", hue="variable",
data=pd.melt(df.drop(['Content', 'Overall', 'Rooms', 'Location', 'Service', 'Sleep Quality',
'Cleanliness', 'Value'], axis=1).astype(float)))
b.axes.set_title('"Business service" and "Check in / front desk" ratings distribution',fontsize=25)
b.set_xlabel("Rating",fontsize=20)
b.set_ylabel("Count",fontsize=20)
On the previous hystogram we can see the distribution of the Check in / front desk and the Business service. There are a lot of "-1" value in both of these categories.
The last thing we want to do is to count and visualize the correlation between ratings. You can see the correlation matrix below. You can see that there is strong correlation between Overall and Rooms, Service and Sleep quality. Surprisingly, the correlation between Location and Overall is low. Some values are obvious. For example, the correlation between Cleanliness and Rooms is high.
colormap = plt.cm.magma
plt.figure(figsize=(16,10))
plt.title('Pearson correlation of ratings', y=1.05, size=15)
sns.heatmap(df.drop(['Content'], axis=1).astype(float).corr(),linewidths=0.01,vmax=1.0, square=True,
cmap=colormap, linecolor='white', annot=True)
In this notebook we performed the exploratory data analysis (EDA) of the TripAdvisor Data Set. We examine the missing values, shape of the dataframe, target variables distributions, length of reviews, correlations between target variables.
This EDA can help to build the model to predict the ratings based on the reviews (semantic analysis).